DNA was extracted from dermal tissue using Mag-Bind Blood and Tissue kits.
For assembly of a reduced representation reference, double digest restriction site associated DNA (ddRAD) using EcoRI and SphI following was performed to create a single library consisting of 24 individuals from all sampled estuaries which was sequenced on a single lane of an Illumina MiSeq DNA sequencer (paired end, 300bp reads). Using longer reads produced from Miseq data for reference assembly increases efficiency during read mapping and SNP calling downstream. Raw reads were demultiplexed using process_radtags (Catchen et al. 2013Catchen, Julian, Paul A. Hohenlohe, Susan Bassham, Angel Amores, and William A. Cresko. 2013. “Stacks: An analysis tool set for population genomics.” Molecular Ecology 22 (11): 3124–40. https://doi.org/10.1111/mec.12354.) and reference contiguous sequence alignments (contigs) reconstructed using the overlapping read assembly option in the dDocent pipeline (Puritz, Hollenbeck, and Gold 2014Puritz, Jonathan B, Christopher M Hollenbeck, and John R Gold. 2014. “dDocent : a RADseq, variant-calling pipeline designed for population genomics of non-model organisms.” PeerJ 2: e431. https://doi.org/10.7717/peerj.431.). Reference assembly was run for a range of combinations of threshold values for K1 (minimum within individual coverage per read), K2 (number of individuals a read must occur in) and c (minimum percent similarity to cluster reads) and a test data set (subset of Hiseq data set described below) mapped to each reference to identify the optimum reference by maximizing the number of reads mapped, minimize the number of reads for which readpairs are mapped to two different contigs. Final parameters selected were K1 = 5, K2 = 6, and c = 0.8. Detailed steps can be found in the 01 Reference_Construction.html-notebook.
For genotyping, two ddRAD libraries were constructed and sequenced on two separate lanes of an Illumina HiSeq 4000. Raw sequences were demultiplexed using process_radtags (Catchen et al. 2013Catchen, Julian, Paul A. Hohenlohe, Susan Bassham, Angel Amores, and William A. Cresko. 2013. “Stacks: An analysis tool set for population genomics.” Molecular Ecology 22 (11): 3124–40. https://doi.org/10.1111/mec.12354.). Quality trimming, read mapping to the reduced representation reference, and SNP calling were performed using the dDocent pipeline (Puritz, Hollenbeck, and Gold 2014Puritz, Jonathan B, Christopher M Hollenbeck, and John R Gold. 2014. “dDocent : a RADseq, variant-calling pipeline designed for population genomics of non-model organisms.” PeerJ 2: e431. https://doi.org/10.7717/peerj.431.). Raw SNPs were filtered using VCFtools (Danecek et al. 2011Danecek, Petr, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, et al. 2011. “The variant call format and VCFtools.” Bioinformatics 27 (15): 2156–8. https://doi.org/10.1093/bioinformatics/btr330.) and custom scripts following O’Leary et al. (2018O’Leary, Shannon J, Jonathan B Puritz, Stuart C Willis, Christopher M Hollenbeck, and David S Portnoy. 2018. “These aren’t the loci you’re looking for: Principles of effective SNP filtering for molecular ecologists.” Wiley/Blackwell (10.1111). https://doi.org/10.1111/mec.14792.), setting thresholds for a minimum sequence and genotype quality of 20, a minimum genotype call rate per locus by estuary of 90%, a minor allele count of three, a minimum genotype depth of five, a mean minimum depth of 15, and a mean maximum depth of 180. Individuals with > 10% missing data were removed. SNPs were further filtered based on allele balance, quality/depth ratio, mapping quality ratio of reference/alternate alleles, properly paired status, strand representation, and variance in depth. Finally, SNPs on the same contig were collapsed into haplotypes using rad_haplotyper (Willis et al. 2017Willis, Stuart C., Christopher M. Hollenbeck, Jonathan B. Puritz, John R. Gold, and David S. Portnoy. 2017. “Haplotyping RAD loci: an efficient method to filter paralogs and account for physical linkage.” Molecular Ecology Resources, February. https://doi.org/10.1111/1755-0998.12647.) producing a final data set consisting of SNP-containing loci (hereafter ‘loci’) for data analysis. In addition, rad_haplotyper flags loci exhibiting patterns of indicative of paralogs or genotyping error due to low coverage which were removed from the final data set. Detailed filtering steps and sequentially applied thresholds are found in 02 Genotyping.html. After haplotyping loci with a global major allele frequency > 95% were removed.
207 individuals were genotypes for 14688 loci (39129 alleles).
A total of 207 YOY and juveniles were genotyped to assess genetic heterogeneity and population differentiation.
To test for genetic heterogeneity and population structure, YOY and Juveniles were grouped by natal estuaries and regions informed by vertebral chemistry analysis defined as South (Corpus Christi Bay, San Antonio Bay, Aransas Bay), Central (Matagorda Bay), and North (Galveston Bay, Sabine Lake). Regional groupings were based on similarities in hydrological characteristics (e.g. temperature, salinity, sources of freshwater input), expected to produce distinctive signatures in vertebral chemistry. Exploratory analysis of genetic data indicated that the Matagorda Bay might be more appropriately grouped with the North region, therefore a second set of grouping of individuals into South and North regions was tested.
Table 1a: Sample size per estuary (east to west - SL: Sabine Lake, GAL: Galveston Bay, MAT: Matagorda Bay, SA: San Antonio Bay, ARA/CC: Aransas & Corpus Christi Bay).
| POP | n |
|---|---|
| SL | 37 |
| GAL | 56 |
| MAT | 29 |
| SA | 8 |
| ARA/CC | 77 |
Table 1b: Sample size per region (North: SL, GAL; Central: MAT; South: SA, ARA/CC).
| REGION | n |
|---|---|
| North | 93 |
| Central | 29 |
| South | 85 |
Table 1c: Sample size per region (North: SL, GAL, MAT; South: SA, ARA/CC).
| REGION2 | n |
|---|---|
| North | 122 |
| South | 85 |
Confidence intervals may provide more reliable biological trends in the data than p-values which require a null hypothesis, e.g. Fst = 0, to permute p-values and can be more helpful with small sample sizes.
To test for genetic heterogeneity across estuaries, global FST (Weir and Cockerham 1984Weir, B. S., and C. Clark Cockerham. 1984. “Estimating F-Statistics for the Analysis of Population Structure.” Evolution 38 (6). Society for the Study of Evolution: 1358. https://doi.org/10.2307/2408641.) was calculated and a 95% confidence interval (CI) determined using 1,000 iterations (loci sampled with replacement) and permuted p-values calculated using 1,000 iterations using functions implemented in hierfstat (Goudet 2005Goudet, Jérôme. 2005. “HIERFSTAT, a package for R to compute and test hierarchical F-statistics.” Molecular Ecology Notes 5 (1). Wiley/Blackwell (10.1111): 184–86. https://doi.org/10.1111/j.1471-8286.2004.00828.x.) and assigner (Gosselin, Anderson, and Bradbury 2016Gosselin, T, Eric C Anderson, and Ian R Bradbury. 2016. “assigner: Assignment Analysis with GBS/RAD Data using R.” R Package. https://doi.org/doi : 10.5281/zenodo.51453.). Similarly, pairwise FST 95%-CI, and permuted p-values were calculated to test for pairwise significant differences among estuaries and regions. Finally, the distribution of FST per locus was assessed to identify most informative loci.
Allele frequencies across all sampled individuals were calculated to compare allele frequency spectra for major allele and minor alleles. In general, loci variable in < 5% of individuals are not considered informative at a population level.
Figure 1: Distribution of major and minor allele frequencies per locus across all individuals.
Loci fixed in > 95% of individuals were removed from the data set for Fst analysis and baseline assessment.
8844 loci (26106 alleles) were retained for further analysis.
Table 2: Global Fst and bootstrapped 95%-confidence intervals (1,000 iterations, sampled with replacement) calculated according to Weir & Cockerham 1984 for individuals grouped by natal estuary.
| FST | N_MARKERS | CI_LOW | CI_HIGH |
|---|---|---|---|
| 0.0003434 | 8844 | 0.0001546 | 0.0005522 |
Individuals were permuted across estuaries to determine significance of global Fst.
Significant genetic heterogeneity was detected among estuaries along the Texas coast (p = 0.034).
Table 3a: Pairwise Fst and bootstrapped 95%-CI (1000 iterations, sampled with replacement) calculated according Weir & Cockerham (1984).
| POP1 | POP2 | FST | CI_LOW | CI_HIGH |
|---|---|---|---|---|
| MAT | SA | 0.0014404 | 0.0003312 | 0.0026518 |
| ARA/CC | GAL | 0.0004158 | 0.0001902 | 0.0006375 |
| ARA/CC | MAT | 0.0004667 | 0.0000960 | 0.0008058 |
| ARA/CC | SL | 0.0003983 | 0.0000937 | 0.0007060 |
| ARA/CC | SA | 0.0003672 | 0.0000000 | 0.0013571 |
| GAL | MAT | 0.0003004 | 0.0000000 | 0.0006948 |
| GAL | SL | 0.0000000 | 0.0000000 | 0.0003085 |
| GAL | SA | 0.0001634 | 0.0000000 | 0.0012783 |
| MAT | SL | 0.0004224 | 0.0000000 | 0.0008971 |
| SL | SA | 0.0000000 | 0.0000000 | 0.0004046 |
Table 3b: Pairwise Fst among all pairs of estuaries
| ESTUARY | ARA/CC | GAL | MAT | SL | SA |
|---|---|---|---|---|---|
| ARA/CC | 0.0000000 | 0.0004158 | 0.0004667 | 0.0003983 | 0.0003672 |
| GAL | 0.0004158 | 0.0000000 | 0.0003004 | 0.0000000 | 0.0001634 |
| MAT | 0.0004667 | 0.0003004 | 0.0000000 | 0.0004224 | 0.0014404 |
| SL | 0.0003983 | 0.0000000 | 0.0004224 | 0.0000000 | 0.0000000 |
| SA | 0.0003672 | 0.0001634 | 0.0014404 | 0.0000000 | 0.0000000 |
Individuals were permuted among estuaries to calculate pairwise Fst (Weir & Cockerham 1984) and determine significance.
Table 3c: Pairwise Fst and permuted p-values (individuals shuffled across estuaries, 1,000 permutations)
| GRP1 | GRP2 | obsFst | PVAL |
|---|---|---|---|
| ARA/CC | GAL | 0.0004158 | 0.0059940 |
| ARA/CC | SL | 0.0003983 | 0.0289710 |
| MAT | SA | 0.0014404 | 0.0369630 |
| ARA/CC | MAT | 0.0004667 | 0.0409590 |
| SL | MAT | 0.0004224 | 0.0999001 |
| MAT | GAL | 0.0003004 | 0.1288711 |
| ARA/CC | SA | 0.0003672 | 0.2897103 |
| GAL | SA | 0.0001634 | 0.3956044 |
| SL | GAL | -0.0000047 | 0.4775225 |
| SL | SA | -0.0005735 | 0.7652348 |
Locus-specific Fst (individuals grouped by estuary).
Figure 2: Distribution of locus-specific Fst-values
Table 4a: Number of loci with Fst > 0.
| FST > 0 | n |
|---|---|
| FALSE | 5141 |
| TRUE | 3703 |
Table 4b: Number of loci with Fst > 0.01
| FST > 0.01 | n |
|---|---|
| FALSE | 7760 |
| TRUE | 1084 |
Table 5a: Pairwise Fst and bootstrapped 95% confidence intervals (1000 iterations, sampled with replacement) among regions calculated according to Weir & Cockerham 1984.
| POP1 | POP2 | FST | CI_LOW | CI_HIGH |
|---|---|---|---|---|
| South | North | 0.0003295 | 0.0001662 | 0.0005097 |
| South | Central | 0.0004879 | 0.0001426 | 0.0008450 |
| North | Central | 0.0002933 | 0.0000000 | 0.0006502 |
Table 5b: Pairwise Fst among all pairs of regions.
| REGION | South | North | Central |
|---|---|---|---|
| South | 0.0000000 | 0.0003295 | 0.0004879 |
| North | 0.0003295 | 0.0000000 | 0.0002933 |
| Central | 0.0004879 | 0.0002933 | 0.0000000 |
Individuals were permuted among regions to determine significance of pairwise Fst.
Table 5c: Significance of pairwise Fst between regions assessed by permuting individuals across regions (1,000 permutations)
| GRP1 | GRP2 | obsFst | PVAL |
|---|---|---|---|
| South | North | 0.0003295 | 0.0029970 |
| South | Central | 0.0004879 | 0.0259740 |
| North | Central | 0.0002933 | 0.1098901 |
Assess locus-specific Fst-values.
Figure 3: Distribution of locus-specific Fst-values for individuals grouped by geographic region
Table 6a: Number of loci with Fst > 0
| FST > 0 | n |
|---|---|
| FALSE | 5367 |
| TRUE | 3477 |
Table 6b: Number of loci with Fst > 0.01
| FST > 0.01 | n |
|---|---|
| FALSE | 8001 |
| TRUE | 843 |
Table 7a: Pairwise Fst and bootstrapped 95% confidence intervals between regions calculated according to Weir & Cockerham 1984.
| POP1 | POP2 | FST | CI_LOW | CI_HIGH |
|---|---|---|---|---|
| South | North | 0.0003043 | 0.000156 | 0.0004521 |
Individuals permuted between regions to calculate pairwise Fst and determine significance.
Table 7b: Significance of pairwise Fst between North/South estuaries (1,000 permutations).
| COMP | GRP1 | GRP2 | obsFst | PVAL |
|---|---|---|---|---|
| South-North | South | North | 0.0003043 | 0.004995 |
Assess locus-specific Fst-values.
Table 8a: Number of loci with Fst > 0
| FST > 0 | n |
|---|---|
| FALSE | 5842 |
| TRUE | 3002 |
Table 8b: Number of loci with Fst > 0
| FST > 0.01 | n |
|---|---|
| FALSE | 8143 |
| TRUE | 701 |
Aging data from vertebrae was used to identify YOY caught in each estuary (Age 0).
Table 9a: Sample size per estuary (east to west - SL: Sabine Lake, GAL: Galveston Bay, MAT: Matagorda Bay, SA: San Antonio Bay, ARA/CC: Aransas & Corpus Christi Bay.
| POP | n |
|---|---|
| SL | 28 |
| GAL | 46 |
| MAT | 27 |
| SA | 6 |
| ARA/CC | 47 |
Table 9b: Sample size per region (North: SL, GAL; Central: MAT; South: SA, ARA/CC).
| REGION | n |
|---|---|
| North | 74 |
| Central | 27 |
| South | 53 |
Table 9c: Sample size per region (North: SL, GAL, MAT; South: SA, ARA/CC).
| REGION2 | n |
|---|---|
| North | 101 |
| South | 53 |
For purposes of baseline assessment, only YOY were retained in the data set to ensure that they were caught in their natal estuary. For genetic baseline assessment all genotyped YOY (N = 154) were included, for assessment of microchemistry and combined data sets YOY with both genetic and microchemistry data available were used.
The ability to assign individuals of unknown origins to source populations was evaluated by testing the robustness of baseline data sets for natal estuaries and regions for a genetic and a combined data set of genetic and microchemistry data) using assignPOP (Chen et al. 2018Chen, Kuan Yu, Elizabeth A. Marschall, Michael G. Sovic, Anthony C. Fries, H. Lisle Gibbs, and Stuart A. Ludsin. 2018. “assignPOP: An r package for population assignment using genetic, non-genetic, or integrated data in a machine-learning framework.” Edited by Timothée Poisot. Methods in Ecology and Evolution 9 (2). Wiley/Blackwell (10.1111): 439–46. https://doi.org/10.1111/2041-210X.12897.), which uses a supervised machine-learning framework to evaluate the discriminatory power of baseline data.
The implemented Monte-Carlo cross-validation estimates mean and variance of assignment accuracy by resampling a set of training individuals and loci to create a baseline and then determine how many test individuals are correctly assigned, resolving bias due to self-assignment (Anderson 2010Anderson, E. C. 2010. “Assessing the power of informative subsets of loci for population assignment: Standard methods are upwardly biased.” Molecular Ecology Resources 10 (4). Wiley/Blackwell (10.1111): 701–10. https://doi.org/10.1111/j.1755-0998.2010.02846.x.; Waples 2010Waples, Robin S. 2010. “High-grading bias: Subtle problems with assessing power of selected subsets of loci for population assignment.” Wiley/Blackwell (10.1111). https://doi.org/10.1111/j.1365-294X.2010.04675.x.). The assignment ability may be affected by lack of distinct differences among baseline groups of individuals, noisy data, or small data sets (< 20 – 50) resulting in inaccurate estimates of allele frequencies.
Low variance loci are likely uninformative, and frequencies of rare alleles are more difficult to estimate accurately. Therefore, loci with a major allele frequency > 95% or > 5% missing data were removed and San Antonio Bay (n = 6 individuals) was not assessed for estuary comparisons, though San Antonio individuals were included in the southern regional baselines.
To eliminate bias associated with unbalanced population sizes (Puechmaille 2016Puechmaille, Sebastien J. 2016. “The program structure does not reliably recover the correct population structure when sampling is uneven: Subsampling and new estimators alleviate the problem.” Molecular Ecology Resources 16 (3). Wiley/Blackwell (10.1111): 608–27. https://doi.org/10.1111/1755-0998.12512.; Wang 2017Wang, Jinliang. 2017. “The computer program structure for assigning individuals to populations: easy to use but easier to misuse.” Molecular Ecology Resources 17 (5). Wiley/Blackwell (10.1111): 981–90. https://doi.org/10.1111/1755-0998.12650.) the same number of training individuals was drawn from each population for assignment tests of estuaries and regions, i.e. the number of training individuals was consistent but the number of (remaining) test individuals being assigned varied by baseline.
To test if subsets of highly informative loci have equal or higher discriminatory power, varying proportions of loci ranked by FST were used as training loci.
assignPOP uses a machine-learning framework to create predictive models, including linear discriminant analysis (lda), support vector machine (svm), naïve Bayes, decision tree and random forest. To identify the best model for each data set each combination of training individuals/loci was drawn 10 times to calculate assignment accuracy overall and for individual baselines in a preliminary analysis; best combination of predictive model and proportion of loci used was determined based on assignment accuracy and precision.
Initial comparisons indicated the svm and lda models are most appropriate for genetic, microchemistry, and combined data sets, only the results for these models are presented here.
The final assignment tests for best model identified were based on 30 iterations as recommended by Chen et al. 2018.
GENETIC DATA
Baseline assessments for estuaries were conducted using 8783 loci (25742 alleles) genotyped for 148 individuals.
Baselines for genetic data were established by randomly drawing 20 training individuals and a subset of the top 1%, 5%, 10%, 25%, 50%, 75%, 90%, and 100% of of loci ranked by Fst and assigning the remaining individuals for 30 iterations for each combination of training individuals and loci.
assign.MC(x = POPassign,
train.inds = 20,
train.loci = c(0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 1),
loci.sample = "fst",
iterations = 30,
model="svm",
pca.method = "original",
scaled = FALSE,
dir="results/estuary_svm/",
multiprocess = TRUE,
processors = 55)
assign.MC(x = POPassign,
train.inds = 20,
train.loci = c(0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 1),
loci.sample = "fst",
iterations = 30,
model="lda",
pca.method = "original",
scaled = FALSE,
dir="results/estuary_lda/",
multiprocess = TRUE,
processors = 55)
Assignment accuracy for overall and individual nurseries was determined by evaluating the proportion of individuals successfully assigned back to their natal estuaries.
Compare combination of predictive model and proportion of loci used to describe estuary baselines using genetic data.
Fig 5: Assignment accuracy to estuary baselines using genetic data.
Table 10a: Mean +/- std assignment accuracy (model: svm),
| ESTUARY | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 0.9 | 1 |
|---|---|---|---|---|---|---|---|---|
| SL | 26.7 +/- 25.4 | 26.7 +/- 25.2 | 25 +/- 17.1 | 28.3 +/- 15.7 | 29.2 +/- 13.7 | 27.1 +/- 13.2 | 25.8 +/- 18 | 27.5 +/- 12.5 |
| GAL | 31.2 +/- 22.9 | 20.4 +/- 10.6 | 23.8 +/- 9.8 | 23.2 +/- 9.2 | 24.9 +/- 9.4 | 29.4 +/- 9.2 | 26.4 +/- 13.1 | 29.2 +/- 15.6 |
| MAT | 28.6 +/- 24 | 35.2 +/- 21.8 | 33.8 +/- 20.7 | 34.8 +/- 22.1 | 37.1 +/- 21.1 | 32.4 +/- 20.5 | 18.6 +/- 15.5 | 18.1 +/- 13.5 |
| ARA/CC | 16.2 +/- 16.1 | 26.5 +/- 12.2 | 28.1 +/- 11.6 | 28.4 +/- 8 | 28.9 +/- 10.2 | 29.6 +/- 11 | 27.4 +/- 14.3 | 16.9 +/- 13.3 |
| Overall-Est | 24.4 +/- 6.8 | 25.1 +/- 5.4 | 26.7 +/- 5.2 | 27.1 +/- 4.2 | 28.2 +/- 4.7 | 29.5 +/- 6 | 25.9 +/- 6.7 | 23 +/- 6.3 |
Table 10b: Mean +/- std assignment accuracy (model: lda),
| ESTUARY | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 0.9 | 1 |
|---|---|---|---|---|---|---|---|---|
| SL | 17.5 +/- 21.4 | 25 +/- 18.6 | 26.7 +/- 35.2 | 17.5 +/- 17.2 | 10.4 +/- 13.6 | 16.2 +/- 17.7 | 25.4 +/- 20.6 | 22.5 +/- 15.2 |
| GAL | 25 +/- 19.8 | 27.9 +/- 15.3 | 17.1 +/- 27 | 13.1 +/- 10.3 | 12.8 +/- 15.5 | 20.5 +/- 11.6 | 20.9 +/- 8.8 | 21.7 +/- 9.1 |
| MAT | 26.7 +/- 23.6 | 36.7 +/- 20.8 | 32.4 +/- 32.9 | 57.6 +/- 29.7 | 65.2 +/- 32.4 | 26.7 +/- 21.1 | 20.5 +/- 21.1 | 21 +/- 21.8 |
| ARA/CC | 25.8 +/- 21.2 | 22.1 +/- 12.4 | 30.6 +/- 36.7 | 20.2 +/- 13.7 | 23.1 +/- 23.3 | 38.9 +/- 15.3 | 38.1 +/- 17.9 | 39.8 +/- 17 |
| Overall-Est | 24.6 +/- 9.7 | 26.2 +/- 5.5 | 25.1 +/- 11.9 | 21 +/- 6.8 | 22 +/- 9.4 | 27.9 +/- 7.3 | 28.2 +/- 7.1 | 28.9 +/- 6.4 |
Predictive model svm/top 75% of loci ranked by Fst chosen as best model.
MICROCHEMISTRY DATA
Baselines for the microchemistry data set were established by randomly drawing 15 training individuals and assigning the remaining individuals for 30 iterations using all microchemistry variables.
assign.MC(x = env,
train.inds = 15,
iterations = 30,
model="svm",
dir="results/chem_est_svm/",
scaled = FALSE,
pca.method = TRUE,
multiprocess = TRUE,
processors = 35)
assign.MC(x = env,
train.inds = 15,
iterations = 30,
model="lda",
dir="results/chem_est_lda/",
scaled = FALSE,
pca.method = TRUE,
multiprocess = TRUE,
processors = 35)
A total of 250 individuals were used for baseline assessment using microchemistry data.
Assignment accuracy for overall and individual estuaries was determined by evaluating the proportion of test individuals successfully assigned back to their natal estuaries.
Compare which predictive model best describes estuary baselines using microchemistry data.
Figure 5: Assignment accuracy to estuaries (microchemistry data)
Table 11: Mean +/- std correct assignment to estuary of origin (microchemistry data)
| ESTUARY | results/chem_est_lda/ | results/chem_est_svm/ |
|---|---|---|
| SL | 54.2 +/- 22.8 | 65 +/- 25.9 |
| GAL | 49.7 +/- 16.9 | 25.7 +/- 20.3 |
| MAT | 54.2 +/- 12.1 | 54.2 +/- 19.6 |
| ARA/CC | 77.1 +/- 13.6 | 72.1 +/- 17 |
| Overall-Est | 58.4 +/- 6.6 | 51.2 +/- 9 |
Predictive model lda chosen.
COMBINED DATA
Table 12: Number of samples per estuary for combined data set
| POP | n |
|---|---|
| ARA/CC | 23 |
| GAL | 25 |
| MAT | 26 |
| SL | 19 |
A total of 93 individuals were used for baseline assessment.
Baselines for the combined data set were established by randomly drawing 15 training individuals and a subset of the top 1%, 5%, 10% ,25%, 50%, 75%, and 100% of of loci ranked by fst and assigning the remaining individuals for 30 iterations for each combination of training individuals and loci and all mirochemistry data.
# data not scaled ====
assign.MC(x = POPassignCON,
train.inds = 15,
train.loci = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1),
loci.sample = "fst",
iterations = 30,
model="svm",
dir="results/comb_est_svm/",
scaled = FALSE,
pca.method = "original",
multiprocess = TRUE,
processors = 35)
assign.MC(x = POPassignCON,
train.inds = 15,
train.loci = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1),
loci.sample = "fst",
iterations = 30,
model="lda",
dir="results/comb_est_lda/",
scaled = FALSE,
pca.method = "original",
multiprocess = TRUE,
processors = 35)
Assignment accuracy determined by assessing the proportion of test individuals correctly assigned to their natal estuary.
Compare combination of predictive model and proportion of loci used to describe estuary baselines.
Figure 6: Assignment accuracy to estuary baselines for combined data set
Table 13a: Mean +/- std assignment accuracy (model: svm),
| ESTUARY | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 1 |
|---|---|---|---|---|---|---|---|
| SL | 25.8 +/- 29 | 32.5 +/- 23.8 | 46.7 +/- 26.9 | 34.2 +/- 23.2 | 35 +/- 29.1 | 27.5 +/- 31 | 20.8 +/- 24.6 |
| GAL | 41 +/- 19.5 | 37 +/- 17.6 | 33.7 +/- 15.2 | 42.3 +/- 16.1 | 43.7 +/- 13.3 | 44.7 +/- 22.9 | 44.7 +/- 22.1 |
| MAT | 46.4 +/- 19.9 | 39.1 +/- 14.6 | 44.8 +/- 17.2 | 40.9 +/- 17.5 | 45.8 +/- 18.3 | 44.5 +/- 18.4 | 43.6 +/- 12.9 |
| ARA/CC | 58.3 +/- 22.6 | 63.3 +/- 15.4 | 65.8 +/- 19.1 | 66.7 +/- 12.4 | 71.7 +/- 14.7 | 73.3 +/- 16 | 74.6 +/- 15.2 |
| Overall-Est | 45.2 +/- 8.2 | 43.5 +/- 8.8 | 46.8 +/- 7.2 | 46.8 +/- 7.7 | 50.1 +/- 8.5 | 49.5 +/- 8.6 | 48.7 +/- 8.7 |
Table 13b: Mean +/- std assignment accuracy (model: lda),
| ESTUARY | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 1 |
|---|---|---|---|---|---|---|---|
| SL | 26.7 +/- 22.7 | 14.2 +/- 17 | 25 +/- 22.7 | 21.7 +/- 28.4 | 22.5 +/- 28.1 | 40.8 +/- 28.2 | 42.5 +/- 19.9 |
| GAL | 40.7 +/- 23.3 | 35.7 +/- 26.7 | 33.7 +/- 20.6 | 22.7 +/- 24.3 | 32.7 +/- 28.3 | 44 +/- 12.8 | 40 +/- 15.3 |
| MAT | 40 +/- 19.9 | 31.2 +/- 23 | 40.9 +/- 19.7 | 28.8 +/- 27.4 | 20.9 +/- 24.2 | 41.5 +/- 12.3 | 48.2 +/- 13.3 |
| ARA/CC | 49.2 +/- 22.7 | 33.3 +/- 23.7 | 58.3 +/- 14.8 | 75.8 +/- 24.6 | 83.8 +/- 18.6 | 79.2 +/- 13.7 | 77.9 +/- 13.4 |
| Overall-Est | 40.8 +/- 9.1 | 31 +/- 6.9 | 41 +/- 7.5 | 37.5 +/- 7.9 | 39.9 +/- 10.8 | 51.3 +/- 5.3 | 52.2 +/- 5.6 |
Predictive model lda/all loci chosen as best model.
GENETIC DATA
A total of 154 individuals were used for baseline assessment.
Regional baselines for genetic data were establisehd by randomly drawing 20 training individuals and a subset of the top 1%, 5%, 10%, 25%, 50%, 75%, 90%, and 100% of of loci ranked by fst and assigning the remaining individuals for 30 iterations for each combination of training individuals and loci.
assign.MC(x = POPassign,
train.inds = 20,
train.loci = c(0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 1),
loci.sample = "fst",
iterations = 30,
model="svm",
pca.method = "original",
scaled = FALSE,
dir="results/region1_svm/",
multiprocess = TRUE,
processors = 35)
assign.MC(x = POPassign,
train.inds = 20,
train.loci = c(0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 1),
loci.sample = "fst",
iterations = 30,
model="lda",
pca.method = "original",
scaled = FALSE,
dir="results/region1_lda/",
multiprocess = TRUE,
processors = 35)
Assigment accuracy evaluated by assesing the proportion of individuals successfully assigned back to their region of origin.
##
## Correct assignment rates were estimated!!
## A total of 240 assignment tests for 3 pops.
## Results were also saved in a 'Rate_of_240_tests_3_pops.txt' file in the directory.
## Correct assignment rates were estimated!!
## A total of 240 assignment tests for 3 pops.
## Results were also saved in a 'Rate_of_240_tests_3_pops.txt' file in the directory.
Compare combination of predictive model and proportion of loci used to describe estuary baselines.
Figure 7: Assignment accuracy to regional baselines using genetic data
Table 14a: Mean +/- std assignment accuracy (model: svm),
| REGION | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 0.9 | 1 |
|---|---|---|---|---|---|---|---|---|
| North | 29.8 +/- 10.9 | 30.7 +/- 12.9 | 30.9 +/- 8.2 | 30.7 +/- 8.4 | 31 +/- 8.3 | 34.5 +/- 9.1 | 42.5 +/- 12.4 | 35.6 +/- 12.3 |
| Central | 34.3 +/- 20.1 | 34.3 +/- 23 | 39.5 +/- 21.8 | 39.5 +/- 19.4 | 40 +/- 16.5 | 37.6 +/- 18.6 | 27.1 +/- 23.2 | 30.5 +/- 17.9 |
| South | 39.3 +/- 15.2 | 39.3 +/- 14.5 | 33.5 +/- 11.6 | 33.3 +/- 10.4 | 32.8 +/- 10.5 | 36.3 +/- 10.8 | 36.4 +/- 17.5 | 31.3 +/- 13.7 |
| Overall-reg3 | 33.4 +/- 5.4 | 34 +/- 7.1 | 32.4 +/- 5.3 | 32.3 +/- 4.6 | 32.3 +/- 5.7 | 35.4 +/- 5.1 | 39.2 +/- 7.9 | 33.7 +/- 8 |
Table 14b: Mean +/- std assignment accuracy (model: lda),
| REGION | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 0.9 | 1 |
|---|---|---|---|---|---|---|---|---|
| North | 31.4 +/- 8.9 | 25.4 +/- 21.4 | 18.9 +/- 22.3 | 16.2 +/- 14.9 | 9.8 +/- 14.1 | 15.4 +/- 10.2 | 25.2 +/- 15.6 | 22.3 +/- 13.5 |
| Central | 37.1 +/- 23 | 47.1 +/- 28.4 | 38.1 +/- 35.5 | 57.6 +/- 34.5 | 63.8 +/- 35.7 | 48.1 +/- 29.4 | 32.4 +/- 20.2 | 33.3 +/- 22.3 |
| South | 34.3 +/- 12 | 30 +/- 15.1 | 44 +/- 34.6 | 32.9 +/- 28.5 | 29.8 +/- 34.2 | 42.5 +/- 22.1 | 42 +/- 19.5 | 42.2 +/- 21.3 |
| Overall-reg3 | 32.8 +/- 5.5 | 28.7 +/- 11.2 | 29.1 +/- 12.6 | 25.2 +/- 12.4 | 20.9 +/- 13.4 | 27.3 +/- 9.3 | 31.6 +/- 9.1 | 30.1 +/- 9.2 |
svm model built using 20 individuals and top 90% of loci ranked by Fst produced highest assignment accuracy.
MICROCHEMISTRY DATA
A total of 250 individuals were used for baseline asessment.
Baselines for the microchemistry data set were established by randomly drawing 20 training individuals and assigning the remaining individuals for 30 iterations (all microchemistry variable used).
assign.MC(x = env,
train.inds = 20,
iterations = 30,
model="svm",
dir="results/chem_reg_svm/",
scaled = FALSE,
pca.method = TRUE,
multiprocess = TRUE,
processors = 35)
assign.MC(x = env,
train.inds = 20,
iterations = 30,
model="lda",
dir="results/chem_reg_lda/",
scaled = FALSE,
pca.method = TRUE,
multiprocess = TRUE,
processors = 35)
Assignment probabilities for microchemistry data set for 30 iterations.
##
## Correct assignment rates were estimated!!
## A total of 30 assignment tests for 3 pops.
## Results were also saved in a 'Rate_of_30_tests_3_pops.txt' file in the directory.
## Correct assignment rates were estimated!!
## A total of 30 assignment tests for 3 pops.
## Results were also saved in a 'Rate_of_30_tests_3_pops.txt' file in the directory.
Compare predictive models to determine which best describes regional baselines.
Figure 8: Assignment accuracy to regional baselines established using microchemistry data (all variables)
Table 15: Mean +/- std assignment accuracy (model: svm),
| REGION | results/chem_reg_lda/ | results/chem_reg_svm/ |
|---|---|---|
| North | 74.4 +/- 7.7 | 58.9 +/- 14.3 |
| Central | 62.2 +/- 21.4 | 45.6 +/- 27.3 |
| South | 79.6 +/- 14.9 | 72.1 +/- 16 |
| Overall-reg3 | 73.6 +/- 5.9 | 59.6 +/- 8.8 |
Predictive model lda chosen as best model
COMBINED DATA
Table 16: Number of individuals per region with microchemistry data and genetic data
| REGION | n |
|---|---|
| South | 28 |
| North | 44 |
| Central | 26 |
A total of 98 individuals were used for baseline assesment
Baselines were calculated by randomly drawing 20 test individuals and 1, 5, 10, 25, 50, 75, and 100% of loci ranked by Fst.
assign.MC(x = POPassignCON,
train.inds = 20,
train.loci = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1),
loci.sample = "fst",
iterations = 30,
model="svm",
dir="results/comb_region1_svm/",
scaled = FALSE,
pca.method = "original",
multiprocess = TRUE,
processors = 35)
assign.MC(x = POPassignCON,
train.inds = 20,
train.loci = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1),
loci.sample = "fst",
iterations = 30,
model="lda",
dir="results/comb_region1_lda/",
scaled = FALSE,
pca.method = "original",
multiprocess = TRUE,
processors = 35)
Assignment accuracy of baselines was assessed as the proportion of individuals successfully assigned back to their natal regions.
Compare combination of predictive model and proportion of loci used to describe estuary baseline
Figure 9: Assignment accuracy to regional baselines (combined data set)
Table 17a: Mean +/- std assignment accuracy (model: svm),
| REGION | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 1 |
|---|---|---|---|---|---|---|---|
| North | 59.2 +/- 15.9 | 56.1 +/- 11.6 | 66 +/- 12.2 | 66.7 +/- 10.6 | 70.8 +/- 9.7 | 71.2 +/- 9 | 71.5 +/- 9.2 |
| Central | 50.6 +/- 20.8 | 45.6 +/- 23.1 | 52.8 +/- 21.5 | 58.9 +/- 17.9 | 53.9 +/- 19.4 | 51.7 +/- 18.7 | 51.7 +/- 18.7 |
| South | 69.2 +/- 17.6 | 75 +/- 15.7 | 80 +/- 14.9 | 81.7 +/- 13.4 | 81.2 +/- 14.2 | 82.5 +/- 14.5 | 82.9 +/- 14.1 |
| Overall-reg3 | 59.9 +/- 9.2 | 58.4 +/- 8.6 | 66.8 +/- 6.9 | 68.6 +/- 6.4 | 70.4 +/- 6.2 | 70.5 +/- 5.5 | 70.8 +/- 5.6 |
Table 17b: Mean +/- std assignment accuracy (model: lda),
| REGION | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 1 |
|---|---|---|---|---|---|---|---|
| North | 52.3 +/- 15 | 34 +/- 19.3 | 48.5 +/- 16.9 | 48.9 +/- 21.4 | 53.1 +/- 23 | 72.8 +/- 10.7 | 68.9 +/- 9.1 |
| Central | 51.9 +/- 19.3 | 38.6 +/- 23.5 | 37.1 +/- 22.1 | 35.2 +/- 21.8 | 43.8 +/- 31.6 | 43.8 +/- 20.2 | 55.2 +/- 20.8 |
| South | 56.7 +/- 15.3 | 46.3 +/- 21.1 | 63.7 +/- 13.4 | 72.2 +/- 20 | 74.1 +/- 27.7 | 80.4 +/- 10.4 | 77 +/- 11.6 |
| Overall-reg3 | 53.2 +/- 10.1 | 37.5 +/- 12.2 | 49.9 +/- 9.5 | 51.7 +/- 11.9 | 56.1 +/- 15.3 | 69.5 +/- 7.8 | 68.4 +/- 6.2 |
Baselines built using svm and 20 training individuals and all loci chosen as best model.
GENETIC DATA
Baselines for genetic data were established by randomly drawing 30 training individuals and a subset of the top 1%, 5%, 10%, 25%, 50%, 75%, 90%, and 100% of of loci ranked by Fst and assigning the remaining individuals for 30 iterations.
assign.MC(x = POPassign,
train.inds = 30,
train.loci = c(0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 1),
loci.sample = "fst",
iterations = 30,
model="svm",
pca.method = "original",
scaled = FALSE,
dir="results/region2_svm/",
multiprocess = TRUE,
processors = 35)
assign.MC(x = POPassign,
train.inds = 45,
train.loci = c(0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 1),
loci.sample = "fst",
iterations = 30,
model="lda",
pca.method = "original",
scaled = FALSE,
dir="results/region2_lda/",
multiprocess = TRUE,
processors = 35)
Efficiency of baselines was assessed as the proportion of individuals successfully assigned back to their natal regions.
Compare combinations of predictive models and proportion of loci ranked by Fst used to describe regional baselines.
Figure 10: Assignment accuracy to regional baselines (genetic data set)
Table 18a: Mean +/- std assignment accuracy (model: svm),
| REGION | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 0.9 | 1 |
|---|---|---|---|---|---|---|---|---|
| North | 48.8 +/- 14.2 | 50.3 +/- 13.3 | 49.9 +/- 11.1 | 50.6 +/- 12.8 | 50.4 +/- 11.2 | 51.4 +/- 12 | 51.3 +/- 12.2 | 51 +/- 12.5 |
| South | 48.6 +/- 9.6 | 50.5 +/- 8.9 | 49.7 +/- 8.1 | 48.2 +/- 7.4 | 50.8 +/- 8.8 | 49.7 +/- 6.5 | 53.1 +/- 8.2 | 49.2 +/- 10.7 |
| Overall-reg2 | 48.7 +/- 6.4 | 50.5 +/- 5.7 | 49.7 +/- 5.2 | 48.8 +/- 5.5 | 50.7 +/- 6.1 | 50.1 +/- 4.3 | 52.7 +/- 5 | 49.6 +/- 7 |
Table 18b: Mean +/- std assignment accuracy (model: lda),
| REGION | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 0.9 | 1 |
|---|---|---|---|---|---|---|---|---|
| North | 48.6 +/- 8.3 | 49.3 +/- 6.8 | 51.7 +/- 6.3 | 54.6 +/- 8.1 | 58.6 +/- 9.4 | 49.2 +/- 10.6 | 41.7 +/- 8.5 | 38.9 +/- 8.7 |
| South | 50 +/- 20 | 47.1 +/- 19.6 | 43.8 +/- 17 | 49.6 +/- 15.2 | 55 +/- 19.6 | 60.4 +/- 18 | 59.6 +/- 17.9 | 62.1 +/- 17.8 |
| Overall-reg2 | 48.8 +/- 7.5 | 49 +/- 5.9 | 50.7 +/- 5.2 | 54 +/- 6.6 | 58.2 +/- 7 | 50.6 +/- 8.6 | 43.9 +/- 6.7 | 41.8 +/- 7.1 |
Assignment accuracy highest for model built using lda and 50% of loci ranked by Fst.
MICROCHEMISTRY DATA
Monte-Carlo cross-validation was used for the assignment test to evaluate if the combined data set has sufficient discriminatory power.
Baselines for the microchemistry data set were established by randomly drawing 20 training individuals and assigning the remaining individuals for 30 iterations.
assign.MC(x = env,
train.inds = 20,
iterations = 30,
model="svm",
dir="results/chem_reg2_svm/",
scaled = FALSE,
pca.method = TRUE,
multiprocess = TRUE,
processors = 35)
assign.MC(x = env,
train.inds = 20,
iterations = 30,
model="lda",
dir="results/chem_reg2_lda/",
scaled = FALSE,
pca.method = TRUE,
multiprocess = TRUE,
processors = 35)
Efficiency of baselines assessed as proportion of test individuals correctly assigned to their natal region.
##
## Correct assignment rates were estimated!!
## A total of 30 assignment tests for 2 pops.
## Results were also saved in a 'Rate_of_30_tests_2_pops.txt' file in the directory.
## Correct assignment rates were estimated!!
## A total of 30 assignment tests for 2 pops.
## Results were also saved in a 'Rate_of_30_tests_2_pops.txt' file in the directory.
Compare assignment accuracy among predictive models.
Figure 11: Assignment accuracy to regional baselines (all microchemistry variables)
Table 19: Mean +/- std assignment accuracy (model: svm),
| REGION | results/chem_reg2_lda/ | results/chem_reg2_svm/ |
|---|---|---|
| North | 83.7 +/- 6.2 | 72 +/- 16.2 |
| South | 85 +/- 13.7 | 77.9 +/- 12.1 |
| Overall-reg2 | 83.9 +/- 5.2 | 72.8 +/- 13.7 |
lda model chosen as best predictive model.
COMBINED DATA
Baselines were created using 20 individuals and 1, 5, 10, 25, 50, 75, and 100% of loci ranked by Fst combined with microchemistry data for 30 iterations.
assign.MC(x = POPassignCON,
train.inds = 20,
train.loci = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1),
loci.sample = "fst",
iterations = 30,
model="svm",
dir="results/comb_region2_svm/",
scaled = FALSE,
pca.method = "original",
multiprocess = TRUE,
processors = 35)
assign.MC(x = POPassignCON,
train.inds = 20,
train.loci = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1),
loci.sample = "fst",
iterations = 30,
model="lda",
dir="results/comb_region2_lda/",
scaled = FALSE,
pca.method = "original",
multiprocess = TRUE,
processors = 35)
Efficiency of baselines was determined by assessing the proportion of individuals assigned back to their region of origin.
Compare combinations of predictive model and proportion of loci used to describe estuary baselines.
Figure 12: Assignment accuracy of regional baselines for combined data set
Table 20a: Mean +/- std assignment accuracy (model: svm),
| REGION | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 1 |
|---|---|---|---|---|---|---|---|
| North | 69.1 +/- 15.3 | 77.4 +/- 12.3 | 77.1 +/- 9.8 | 80.3 +/- 7 | 81.5 +/- 7.7 | 80.7 +/- 7.6 | 81.4 +/- 7.3 |
| South | 76.2 +/- 14.1 | 81.7 +/- 14.6 | 81.7 +/- 12.2 | 82.9 +/- 13.3 | 78.8 +/- 15.1 | 79.6 +/- 16.6 | 79.6 +/- 14.9 |
| Overall-reg2 | 70.1 +/- 12.6 | 78 +/- 10.2 | 77.8 +/- 8.1 | 80.6 +/- 5.8 | 81.1 +/- 6.8 | 80.5 +/- 6.1 | 81.1 +/- 6 |
Table 20b: Mean +/- std assignment accuracy (model: lda),
| REGION | 0.01 | 0.05 | 0.1 | 0.25 | 0.5 | 0.75 | 1 |
|---|---|---|---|---|---|---|---|
| North | 65.9 +/- 11.2 | 61.5 +/- 13.2 | 72.9 +/- 6.8 | 77.4 +/- 10 | 85.3 +/- 8.3 | 89.1 +/- 5.1 | 90.3 +/- 5.1 |
| South | 61.7 +/- 16.7 | 65 +/- 18.4 | 70.8 +/- 15.5 | 72.9 +/- 14 | 82.5 +/- 13.8 | 83.8 +/- 12.8 | 82.5 +/- 11.7 |
| Overall-reg2 | 65.3 +/- 10 | 62 +/- 11.8 | 72.6 +/- 5.8 | 76.8 +/- 8.5 | 84.9 +/- 7.7 | 88.4 +/- 4.9 | 89.3 +/- 4.9 |
Assignment accuracy was highest to baselines created using lda model and all loci.
Compare assignment accuracy for genetics, microchemistry and combined data sets for baselines grouping individuals by estuary and three vs. two regions.
Figure 13: Assignment accuracy to baselines using combined data set